Closed Bug 1698184 Opened 4 years ago Closed 3 years ago

FOG Dynamic Metric Implementation (Build Faster/Artifact Build/Web Extensions)

Categories

(Toolkit :: Telemetry, task, P2)

task

Tracking

()

RESOLVED FIXED
103 Branch
Tracking Status
firefox103 --- fixed

People

(Reporter: chutten, Assigned: chutten)

References

Details

(Whiteboard: [telemetry:fog:m?])

Attachments

(12 files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

Following the design from bug 1662863, implement, test, and document dynamic metric support for FOG.

(This may end up being a metabug)

This covers:

  • Supporting automatic dynamic registration of metrics for use by Web Extensions and Build Faster/Artifact Builds
  • Tests
  • Documentation to act as the authoritative and living evolution of the design doc

This does not cover:

  • Manual dynamic registration at runtime of metrics not described in a metrics.yaml

Picking this up to look into it again. Additional uses of such a mechanism now include

  • Dynamic Telemetry: where some yet-to-be-designed system on a server sends configurations to a yet-to-be-designed system in Firefox Desktop which will need to register new metrics at runtime

Scope-wise we'll obviously not be supporting Dynamic Telemetry until it exists, but we should ensure what we build can be adapted without too much difficulty in that sort of direction. I've also decided to skip writing in support for Web Extensions in this first implementation (I'll file a follow-up for it) to keep work from ballooning. (This is because extensions that want to use the builtin Glean to report information haven't manifested in the past year, with most using Glean.js because it's just that easy).

Looks as though a skeleton will look like:

  • t/c/g/moz.build - If we don't have CONFIG['COMPILE_ENVIRONMENT'] we can trigger a build script to generate the necessary runtime metrics files in a runtime-accessible location. (If we do have CONFIG['COMPILE_ENVIRONMENT'] we should delete same).
    • Likely does not support Build Faster where we have a compile environment, but we don't want to use it.
  • t/c/g/build_scripts/ - New script to gather the various and several metrics.yaml and pings.yaml files and put them in a accessible location (cf. generate_JSON_definitions in t/c/g/build_scripts/gen_{scalar|histogram|event}_data.py)
    • Should be testable in t/c/g/tests/pytest
  • t/c/g/bindings/Glean{Pings}.cpp - The NamedGetter is the first necessary voluntary call into the Glean or GleanPings global that presages the use of a metric or ping that may have been added since the last compile. This is the latest we can guarantee that we're still before the first use of an artefact-build-only metric or ping. Here's where we'll need to load those runtime metrics and pings files
    • This is on the main thread. Which sucks.
    • We'll need to do this I/O and processing synchronously. Which also sucks.
    • However, we can limit this to local developer builds by checking !MOZILLA_OFFICIAL (like we do to determine whether to disable upload)
      • Optionally we could also check MOZ_UPDATE_CHANNEL == "default" like Telemetry does, but we should probably stick to the same criteria we use for disable upload.
  • t/c/g/bindings/artefact/ (we might want a submodule here for organization) It'll own the storage and have APIs for adding new records and retrieval (notice that we need to enumerate names in Category as well as support lookups by fully-qualified name). The Category
    • With an extern "C" API for adding new records, we can write the file I/O and processing in a Rust crate, which is where I'd prefer it.
      • yaml-rust and serde_yaml are already in tree, after all (if we decide to use YAML for the runtime metrics and pings files).

Sprinkle with tests, soak in documentation (developer and user), bake at 314F for 15.926min, and I think we'll be good to go.

Assignee: nobody → chutten
Severity: -- → N/A
Status: NEW → ASSIGNED
Priority: P3 → P2

Nope, I'm wrong. Because of course I didn't say anything about what lies beneath. To be fair, neither does the design. (Who wrote that, anyway)

That'll get us the strings and let us create the C++ metrics instances with fresh metrics ids, but any API calls against those instances will hit FFI and immediately panic when there's nothing in __glean_metric_maps by that id. We could noop things at the C++ layer, but then Artefact Builds that add instrumentation can't run tests, which discourages writing tests, which is a Bad Thing.

So we'll need to go deeper, creating Rust metrics instances and placing them somewhere the FFI macros can find them. These instances will be near-identical to the ones that the Rust consumers will be using by name (meaning we'll be duplicating some memory here. Good thing we're limiting it to developer builds (for now, and to individual runtime-declared metrics in the future)) differing only in the differences that the instrumentor made to the instrumentation and that new metric id.

Why else is this Rust stuff important? That's where IPC lives. And we have enough privileged JS in non-main processes that you bet your buns we'll need it.

In summary, we'll need APIs on the fog crate for registering new at-runtime metrics. The file processing crate will call that API to fill the Rust maps (MetricId -> Metric Instance, per metric type (cf __glean_metric_maps) around the same time it's calling the C API to fill the C++ maps (Fully Qualified Metric Name -> metric_entry_t (packed id + type), Category string list, Metrics string list.).

Labeled Metrics

As always, these are going to be a pain. Not least because MIN_LABELED_SUBMETRIC_ID is taking the "2^27 bit is on" encoding scheme from the design doc. I think we'll give it the 2^26th bit so we can have 2^25 static metrics sharing amongst them another 2^25 submetrics, and 2^25 runtime-defined metrics sharing amongst them a different 2^25 submetrics. 2^25 is still 33.5M, so we should have plenty of space (for now).


Now I think that might cover it. No broad changes to the Design, just some augmentation. But we'll see what happens with the code starts hitting the compiler.

Storing what amounts to being a pointer and a length is kinda weird considering
we only ever operate on the string data it means.

nsDependentCString is a teensy bit larger (stores the string as a pointer
(size_t width) instead of as a uint32_t index, stores data flags in a
uint16_t, and some class flags in another uint16_t. So, altogether, we're
going from 4-byte Category instances to (on systems with 64-bit pointers)
10-byte Category instances.) but it:

  • Gives us better methods (makes it harder for me to get my pointer
    arithmetic wrong)
  • Is more flexible, which we'll need when we start using strings
    that aren't in the global category string table

Depends on D143044

Attachment #9271077 - Attachment description: WIP: Bug 1698184 - Move submetrics from 2^27 to 2^26 to make room for dynamic metrics r?janerik! → Bug 1698184 - Move submetrics 2^25 to make room for dynamic metrics r?janerik!
Attachment #9271078 - Attachment description: WIP: Bug 1698184 - Upgrade Category to store a proper string type r?janerik! → Bug 1698184 - Upgrade Category to store a proper string type r?janerik!
Attachment #9271079 - Attachment description: WIP: Bug 1698184 - Add missing headers revealed by changed include order r?janerik! → Bug 1698184 - Add missing headers revealed by changed include order r?janerik!

Not a lot works in non-main processes, but having an ability to, from xpcshell,
register metrics at runtime would simplify testing more than a smidge.

Depends on D143046

We're gonna try our best to keep most of the JOG stuff in bindings/jog.

We won't always be able to. For example:
In order for the created metrics instances to be available to e.g. FFI,
we'll have to codegen them into the fog crate.

Pieces include

  • A script for collecting all the ids for metric types
  • A template for generating the factory that can build each of those types
  • mozbuild integration

We go from one place where metrics can be (the static maps) to two
(we throw in the factory's dynamic maps (warehouses?))

(( Actually from two to three when you consider the submetric maps ))

Thanks to how we architected things, the necessary macro changes are nice and
small.

I feel as though there's a refactor where both IPC and FFI can reuse the same
macros, but for now this is a fairly copy-pasta-heavy change.

Depends on D143049

This provides the necessary C++ glue between the Rust factory (and its FFI) and
the JS layer that'll be using this, and also the necessary storage.

Adds a test-only method to JS that permits the runtime registration of metrics.
Also uses that to cover JOG with tests: registering and smoke-testing metrics
of each metric type.
(Events being a notable (temporary) exception)

I don't like writing parsers, but here I am: writing parsers.
This is not designed to be ergonomic, just complete.
No one should be using this except to test the internals, so that's okay:
we're only hurting ourselves : |

Depends on D143051

These patches aren't a complete solution, so there'll be some follow-ups to file. Specific notable omissions presently include:

  • event metrics (due to the list of allowed extra keys needing to be static in the SDK at present)
  • Pings
  • Artefact Build Support
    • Collecting, collating, writing existing metrics definitions to a single file in the objdir when there's no COMPILE_ENVIRONMENT
    • (Removing ^ when there is a COMPILE_ENVIRONMENT to ensure we don't register at runtime things we codegen'd at build time)
    • Reading ^ synchronously on the main thread when the first metric is looked retrieved in JS
    • Registering the metrics so read.
  • Tests and docs for all (especially #defines in use)
Depends on: 1767037
Blocks: 1767037
No longer depends on: 1767037
Blocks: 1767039
Blocks: 1767040
Blocks: 1767055
Depends on: 1772156
Attachment #9271077 - Attachment description: Bug 1698184 - Move submetrics 2^25 to make room for dynamic metrics r?janerik! → Bug 1698184 - Move submetrics to 2^25 to make room for dynamic metrics r?janerik!
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/70a4a8e1c94e Move submetrics to 2^25 to make room for dynamic metrics r=janerik https://hg.mozilla.org/integration/autoland/rev/bcb8d3298895 Upgrade Category to store a proper string type r=janerik https://hg.mozilla.org/integration/autoland/rev/b890543b9677 Add missing headers revealed by changed include order r=janerik https://hg.mozilla.org/integration/autoland/rev/13f5ec04f25b Ensure Services.fog can be reached in non-parent processes r=janerik https://hg.mozilla.org/integration/autoland/rev/1b95ed4b63a9 Set out a skeleton of the JOG system r=janerik https://hg.mozilla.org/integration/autoland/rev/a21f4220cf2a Build a factory for runtime-defined metrics r=janerik https://hg.mozilla.org/integration/autoland/rev/202051f1fee2 Augment FOG's FFI to support runtime-defined metrics r=janerik https://hg.mozilla.org/integration/autoland/rev/5979019fc8b3 Support runtime-registered metrics in FOG IPC r=janerik https://hg.mozilla.org/integration/autoland/rev/b4c517c88073 Permit the registration of FOG metrics at runtime, and their use r=janerik https://hg.mozilla.org/integration/autoland/rev/752f332862e3 Test runtime registration of FOG metrics r=janerik https://hg.mozilla.org/integration/autoland/rev/6d33f8bea52a Add docs for JOG r=janerik https://hg.mozilla.org/integration/autoland/rev/8c7c5b07fdff Fix local Glean SDK instruction typo r=janerik
Regressions: 1776386

I'm seeing "TODO: implement" being printed when doing an artifact build on macOS. Is it possible that's coming from https://searchfox.org/mozilla-central/rev/5644fae86d5122519a0e34ee03117c88c6ed9b47/toolkit/components/glean/build_scripts/glean_parser_ext/jog.py#99 , and is there already a bug on file to track, uh, implementing that? And/or shutting up the warning? :)

Flags: needinfo?(jrediger)

(In reply to :Gijs (he/him) from comment #18)

I'm seeing "TODO: implement" being printed when doing an artifact build on macOS. Is it possible that's coming from https://searchfox.org/mozilla-central/rev/5644fae86d5122519a0e34ee03117c88c6ed9b47/toolkit/components/glean/build_scripts/glean_parser_ext/jog.py#99 , and is there already a bug on file to track, uh, implementing that? And/or shutting up the warning? :)

Probably bug 1767055. chutten's back next week.
Is it only printed once or multiple times? If really too annoying we can get rid of that print.

Flags: needinfo?(jrediger)

(In reply to Jan-Erik Rediger [:janerik] from comment #19)

(In reply to :Gijs (he/him) from comment #18)

I'm seeing "TODO: implement" being printed when doing an artifact build on macOS. Is it possible that's coming from https://searchfox.org/mozilla-central/rev/5644fae86d5122519a0e34ee03117c88c6ed9b47/toolkit/components/glean/build_scripts/glean_parser_ext/jog.py#99 , and is there already a bug on file to track, uh, implementing that? And/or shutting up the warning? :)

Probably bug 1767055. chutten's back next week.
Is it only printed once or multiple times?

AFAICT only once, and only when doing the complete build (./mach build), not for subsequent ./mach build faster.

If really too annoying we can get rid of that print.

It's not so much annoying as a bit mystifying. It would be nice to have a more meaningful message, especially if there are consequences to the resulting build that are developer-relevant (e.g. are glean metrics going to be broken?).

(In reply to :Gijs (he/him) from comment #20)

(In reply to Jan-Erik Rediger [:janerik] from comment #19)

If really too annoying we can get rid of that print.

It's not so much annoying as a bit mystifying. It would be nice to have a more meaningful message, especially if there are consequences to the resulting build that are developer-relevant (e.g. are glean metrics going to be broken?).

Yeah, I think that was an oversight when this landed just before chutten went out. He's back next week, so probably we can wait until then.
Either by then we make that message nicer, remove it or we'll land the full work really

Whoopsie. Sorry about that. It shouldn't be printed at all and I missed it in the logspam. bug 1767055 is indeed where that'll be implemented, but it's a big'un so lemme file a small'n'quick bug to remove this print in the meantime.

Regressions: 1784551
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: